Skip to content

Conversation

@pl752
Copy link

@pl752 pl752 commented Dec 11, 2025

I want to propose set of changes aimed at improving performance, which I have implemented and used for some time in my (private) projects.
The main goal of these changes is to significantly reduce allocations to heap by using stack allocations, array pool and avoiding unnecessary allocations in first place.
I have created topic in mailing list
I will appreciate opinions and help with testing, as I was used these changes for a while without any anomalies, though I didn't run thorough tests with all versions (I am using fb 3 server). Also the changes shouldn't have changed observable behavior.

@niekschoemaker
Copy link
Contributor

Personally most of the changes seem to make sense, but I would make the case that the Auth part does become way to complex with these changes (and also not sure how often that code even runs, cause I suppose it runs once per connection so probably not too hot of a path)

The other parts do seem to make sense, especially the ReaderWriter optimizations, as those run for each query.

Did you however happen to run the benchmarks against this to see what actual change it makes to performance?

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Unfortunately, I haven't got to running benchmarks yet, however changes resulted in significant reduction of cpu time usage and allocations in application performance profiling runs, I will try to perform more thorough benchmarks and correctness tests soon, when I will have some free time

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Also I agree that auth part is a case of over-optimization and can be omitted. I just applied change pattern to everything which allocates temporary buffers and I have got an eye on. So optimizations for things which run once per session/connection aren't necessary

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd: I have run the Perf thing I found in a solution (idk if it is any representative) And yeah, the speed difference is pretty negligible, however reduction in allocations can be clearly observed

Perf benchmarks
BenchmarkDotNet v0.15.8, Windows 10 (10.0.19044.6691/21H2/November2021Update)
AMD Ryzen 7 5800H with Radeon Graphics 3.20GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.101
  [Host]  : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  NuGet   : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  Project : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3

Jit=RyuJit  Platform=X64  Toolchain=.NET 8.0
WarmupCount=3

| Method  | Job     | BuildConfiguration | DataType             | Count | Mean        | Error     | StdDev    | Ratio | Gen0    | Allocated | Alloc Ratio |
|-------- |-------- |------------------- |--------------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|------------:|
| Execute | NuGet   | ReleaseNuGet       | bigint               | 100   | 20,322.3 us | 212.53 us | 188.40 us |  1.00 | 31.2500 |  307.4 KB |        1.00 |
| Execute | Project | Release            | bigint               | 100   | 20,160.8 us | 175.47 us | 146.52 us |  0.99 |       - | 237.61 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | bigint               | 100   |    482.7 us |   4.17 us |   3.90 us |  1.00 |  6.8359 |  56.64 KB |        1.00 |
| Fetch   | Project | Release            | bigint               | 100   |    484.2 us |   3.33 us |   2.78 us |  1.00 |  4.8828 |  40.35 KB |        0.71 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Execute | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   | 20,406.9 us | 217.86 us | 193.12 us |  1.00 | 31.2500 | 311.34 KB |        1.00 |
| Execute | Project | Release            | varch(...) utf8 [30] | 100   | 20,251.5 us | 118.63 us | 110.97 us |  0.99 |       - | 238.43 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   |    490.7 us |   3.71 us |   3.47 us |  1.00 |  6.8359 |  60.51 KB |        1.00 |
| Fetch   | Project | Release            | varch(...) utf8 [30] | 100   |    494.8 us |   6.60 us |   5.85 us |  1.01 |  4.8828 |   41.1 KB |        0.68 |

// * Hints *
Outliers
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.28 ms)
  CommandBenchmark.Execute: Project -> 2 outliers were removed (20.81 ms, 21.15 ms)
  CommandBenchmark.Fetch: Project   -> 2 outliers were removed (499.44 us, 507.61 us)
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.71 ms)
  CommandBenchmark.Fetch: Project   -> 1 outlier  was  removed (528.00 us)

Also firebird 3 is used, disk used is OEM samsung nvme 2tb (pm9a1, aka oem 980 pro), 32gb of ddr4 ram @3200MT JEDEC, dual channel ofc

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd2: Ran tests with firebird 3 (no embedded), so it does need further testing with other versions (especially embedded and batch operations in modern fb), there was an issue with boolean reading due to _smallbuffer being used both for reading useful bytes and pad (which doesn't affect types which don't get padded). Also, small test run time reduction was observed (aka 24.1 -> 23.5 mins, but without repeatability checks) and no changes in pass/failed/skipped numbers were noticed (after the fix)

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd3: performed tests with embedded engine, all passed

@pl752
Copy link
Author

pl752 commented Dec 12, 2025

Upd4:
TLDR: Written some benchmarks specific to my (unfortunately private) solution's queries. Changes in query execution timing sometimes is hard to register due to fb3 engine being the main bottleneck in testing scenarios even in ideal conditions (localhost with fast cpu and nvme), however, it seems that query creation/preparation benefited significantly and also massive boost observed in string operations due to rune conversion rework and also positive side effects in memory and local cpu time utilization can be observed.

Practical benchmark results
//Update multiple: Optimized (local_opt2)
| Method                                      | UpdateRows | Mean        | Error     | StdDev    | Gen0    | Allocated |
|-------------------------------------------- |----------- |------------:|----------:|----------:|--------:|----------:|
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,695.5 us |  33.42 us |  39.78 us |  3.9063 |  42.78 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,000.3 us | 481.56 us | 426.89 us | 83.3333 | 867.56 KB |


//Update multiple: Original (master)
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,704.9 us |  33.17 us |  52.62 us |  3.9063 |  46.98 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,416.1 us | 634.02 us | 593.06 us | 83.3333 |  985.1 KB |


//Single insert/upsert: Optimized
| Select_LoadWBSellerAccountsAsync            | -         |    717.2 us |  14.08 us |  14.46 us |  1.9531 |  29.26 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    708.6 us |  12.61 us |  11.18 us |  3.9063 |  33.58 KB |


//Single insert/upsert: Original
| Select_LoadWBSellerAccountsAsync            | -         |    741.5 us |  14.50 us |  18.86 us |  3.9063 |   33.5 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    724.0 us |  13.94 us |  18.13 us |  3.9063 |  37.22 KB |


//Select multiple mixed (3 int, 1 literal char string): Optimized
| Method                              | Rows   | Mean           | Error        | StdDev        | Gen0       | Gen1      | Allocated    |
|------------------------------------ |------- |---------------:|-------------:|--------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |       741.1 us |     19.82 us |      57.83 us |          - |         - |     47.52 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |     3,507.2 us |     66.56 us |      81.74 us |          - |         - |     421.2 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    31,362.5 us |  5,421.76 us |  15,986.17 us |          - |         - |   4078.15 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   321,426.4 us | 17,596.63 us |  51,884.06 us |  4000.0000 |         - |  40710.48 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,394,208.6 us | 67,371.50 us | 193,301.47 us | 49000.0000 | 9000.0000 | 407078.14 KB |


//Select multiple mixed (3 int, 1 literal char string): Original 
	(Yes, 1.09 to >3x in speed and 10x in allocation volumes 
	and when profiling, actually, ~100x difference in allocate/free event counters)
| Method                              | Rows   | Mean         | Error      | StdDev      | Median       | Gen0        | Gen1        | Allocated     |
|------------------------------------ |------- |-------------:|-----------:|------------:|-------------:|------------:|------------:|--------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |     1.611 ms |  0.0506 ms |   0.1453 ms |     1.604 ms |           - |           - |     457.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |    11.138 ms |  0.1882 ms |   0.1760 ms |    11.129 ms |           - |           - |    4511.55 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    34.017 ms |  4.9421 ms |  14.4162 ms |    25.360 ms |   5000.0000 |   1000.0000 |   44988.63 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   346.085 ms | 23.2037 ms |  68.4167 ms |   337.300 ms |  55000.0000 |  11000.0000 |  449544.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,709.593 ms | 73.9280 ms | 194.7560 ms | 3,695.932 ms | 550000.0000 | 110000.0000 | 4494710.22 KB |

//Select multiple int only (3 int): Optimized
| Method                              | Rows    | Mean           | Error       | StdDev      | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       376.7 us |    17.69 us |    50.19 us |          - |         - |     11.73 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,102.8 us |    54.87 us |   160.06 us |          - |         - |     63.99 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,537.8 us |   689.68 us | 2,033.53 us |          - |         - |    497.61 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,131.8 us |   137.98 us |   115.22 us |          - |         - |   4927.88 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   176,431.1 us |   956.29 us |   798.54 us |  6000.0000 |         - |  49230.36 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,743,465.8 us | 7,326.00 us | 6,494.31 us | 60000.0000 | 6000.0000 | 497846.96 KB |

//Select multiple int only (3 int): Original
| Method                              | Rows    | Mean           | Error       | StdDev      | Median         | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|---------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       357.9 us |     9.24 us |    25.44 us |       355.4 us |          - |         - |     12.89 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,182.7 us |    49.32 us |   142.29 us |     1,158.6 us |          - |         - |     70.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,541.1 us |   718.80 us | 2,119.41 us |     3,485.7 us |          - |         - |    561.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,280.0 us |   340.21 us |   454.17 us |    18,246.9 us |          - |         - |   5561.08 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   173,885.9 us |   916.01 us |   764.91 us |   173,896.6 us |  6000.0000 |         - |  55558.87 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,745,630.3 us | 4,109.23 us | 3,642.73 us | 1,745,432.2 us | 68000.0000 | 7000.0000 | 561128.53 KB |

It was a little bit tricky to actually obtain measurements which could show improvements, however some interesting observations can be made.
The main explaination of smallness of timing improvements is that despite my benchmarks doing pretty much nothing aside from opening connection, opening configured transaction, creating queries, filling in parameters, preparing if ran multiple times in a row, executing/reading, mapping selected fields to single instance of structure (to avoid performance noise as much as possible), rolling back the transaction and closing the connection; the db engine seems to use the whole cpu core time, while the application thread is slacking most of the time.
However the string reading benefited heavily due to optimizations which reduced overall allocated object number 10-100x, because of the original rune char enumerator, which allocated every (!) rune as a separate char array resulting in tens of millions char[1] and char[2] objects being allocated and then collected shortly after, while the new methods avoid allocation as much as possible, situation is also worsened by the original rune counting method, which just called full enumeration, creating all the char arrays and then simply counted them while never using char data itself. Reducing allocations to the definitive buffers and strings save a lot of cpu time (as the heap allocation even in dotnet is not cheap operation and during string conversions the client library actually becomes the bottleneck instead of the engine).
Also the 10x memory volume difference when working with strings can be observed due to the char[1]/[2] arrays being not only 2-4 bytes of useful raw data, but also 0-6 bytes of padding (in some cases) and 8-16 bytes of meta array object (containing effectively a Span, aka pointer to real data and length of array), and that's not taking into account object type and reference manager related data.
Also the tests of queries of small volumes of rows usually yielded bigger percentage improvements (1 to 4% and 9 to 100+%) as, I think, that better string processing aided query and parameter preparation phase.
Also the timings are not the whole story, as the changes caused some pretty benefitial side effects: reduced amount of allocations ofc. reduce amount of times GC is called, also stackalloc is free (cause it is not the complex allocator function, but rather a tiny sub esp, size ... add esp, size), and also there is a reduction of cpu time used, observable even without the profiler, as I could clearly see main thread being 2-3% (5-6% during select with char) of whole cpu, while optimized version consumed only 1-3%, which means that on low-end client systems or in situation when the application is heavily uses the thread pool, the db reading task will occupy the thread less, thus providing more time for other tasks, when pool is exhausted and queue is used, and for other programs on low-end or heavily loaded machines, in theory.
Also the lack of proper benchmark/test coverage was due to the rework being small experiment out of curiosity, when I noticed, that firebird was top 1-2 consumer of cpu time in my application, but then I decided that experiment was pretty successful and the contribution might be useful for other developers and their solutions, so I decided reaching out with the proposal.

@pl752
Copy link
Author

pl752 commented Dec 14, 2025

Upd5:
TLDR: Implemented set of synthetic benchmarks to test the main changes themselves (isolated from IO) along with some alternative implementations of read/write methods. Huge improvements in rune processing methods. Small performance gains and allocation reduction in synchronous methods, some performance degradation (which shouldn't be noticeable or measurable on the scale of the whole db query/action) in async writing methods for small data types and allocation reduction for most of the async operations. Significant improvements for large string operations.

Note: in the free time I have further tweaked XdrReaderWriter's buffer handling and simplified the code of AuthBlock

Rune op benchmarks
// * Summary *

BenchmarkDotNet v0.15.8, Windows 10 (10.0.19044.6691/21H2/November2021Update)
AMD Ryzen 7 5800H with Radeon Graphics 3.20GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.101
  [Host]     : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  Job-JMDAGQ : .NET 10.0.1 (10.0.1, 10.0.125.57005), X64 RyuJIT x86-64-v3
  Job-OOTPKI : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3


| Method                                       | Job        | Toolchain | Kind                 | RuneLength | MaxRuneCount | Mean        | Error     | StdDev    | Gen0   | Allocated |
|--------------------------------------------- |----------- |---------- |--------------------- |----------- |------------- |------------:|----------:|----------:|-------:|----------:|
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | Ascii                | 128        | 512          |  2,490.0 ns |  39.22 ns |  36.69 ns | 1.0414 |    8712 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | Ascii                | 128        | 512          |    110.8 ns |   2.15 ns |   1.91 ns | 0.0334 |     280 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | Ascii                | 128        | 512          |  3,002.9 ns |  54.82 ns |  51.27 ns | 1.0414 |    8712 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | Ascii                | 128        | 512          |    108.2 ns |   1.94 ns |   1.81 ns | 0.0334 |     280 B |
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | Ascii                | 1024       | 512          |  9,842.2 ns | 196.62 ns | 201.91 ns | 4.0588 |   34056 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | Ascii                | 1024       | 512          |    376.1 ns |   4.77 ns |   4.46 ns | 0.1249 |    1048 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | Ascii                | 1024       | 512          | 11,843.6 ns | 178.52 ns | 166.99 ns | 4.0588 |   34056 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | Ascii                | 1024       | 512          |    423.3 ns |   7.58 ns |   6.72 ns | 0.1249 |    1048 B |
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 128        | 512          |  2,749.5 ns |  53.15 ns |  49.72 ns | 1.0490 |    8776 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 128        | 512          |    133.1 ns |   2.38 ns |   2.11 ns | 0.0410 |     344 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 128        | 512          |  3,315.5 ns |  61.21 ns |  57.26 ns | 1.0490 |    8776 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 128        | 512          |    126.6 ns |   1.03 ns |   0.91 ns | 0.0410 |     344 B |
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 1024       | 512          | 10,957.8 ns | 157.33 ns | 139.47 ns | 4.0894 |   34312 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 1024       | 512          |    453.9 ns |   8.40 ns |   7.86 ns | 0.1554 |    1304 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 1024       | 512          | 12,647.8 ns | 113.42 ns | 100.54 ns | 4.0894 |   34312 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 1024       | 512          |    449.2 ns |   5.65 ns |   5.01 ns | 0.1554 |    1304 B |
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 128        | 512          |  3,225.2 ns |  28.44 ns |  25.21 ns | 1.0681 |    8936 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 128        | 512          |    165.1 ns |   3.34 ns |   5.67 ns | 0.0601 |     504 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 128        | 512          |  3,867.0 ns |  40.20 ns |  37.61 ns | 1.0681 |    8936 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 128        | 512          |    161.0 ns |   3.25 ns |   7.84 ns | 0.0601 |     504 B |
| 'old truncate via EnumerateRunesToChars'     | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 1024       | 512          | 13,368.1 ns | 239.05 ns | 211.91 ns | 4.1656 |   34952 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 1024       | 512          |    584.4 ns |  10.15 ns |   9.00 ns | 0.2317 |    1944 B |
| 'old truncate via EnumerateRunesToChars'     | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 1024       | 512          | 15,874.9 ns | 213.65 ns | 199.85 ns | 4.1656 |   34952 B |
| 'new TruncateStringToRuneCount().ToString()' | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 1024       | 512          |    592.7 ns |  11.31 ns |  10.58 ns | 0.2317 |    1944 B |
  
| Method                                   | Job        | Toolchain | Kind                 | RuneLength | Mean         | Error        | StdDev       | Median       | Gen0    | Allocated |
|----------------------------------------- |----------- |---------- |--------------------- |----------- |-------------:|-------------:|-------------:|-------------:|--------:|----------:|
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | Ascii                | 128        |    619.63 ns |    11.920 ns |    11.707 ns |    620.95 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | Ascii                | 128        |     64.97 ns |     0.332 ns |     0.294 ns |     64.97 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | Ascii                | 128        |    754.45 ns |    14.865 ns |    31.678 ns |    745.93 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | Ascii                | 128        |     66.31 ns |     0.209 ns |     0.196 ns |     66.30 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | Ascii                | 8192       | 35,601.63 ns |   693.533 ns | 1,886.808 ns | 34,675.31 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | Ascii                | 8192       |  3,909.67 ns |    23.210 ns |    21.711 ns |  3,901.30 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | Ascii                | 8192       | 46,619.68 ns |   920.902 ns | 1,752.112 ns | 46,408.94 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | Ascii                | 8192       |  3,924.67 ns |    15.360 ns |    14.368 ns |  3,921.15 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 128        |    649.58 ns |    11.920 ns |    11.150 ns |    651.50 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 128        |     74.12 ns |     1.125 ns |     0.939 ns |     74.00 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 128        |    745.95 ns |    14.848 ns |    14.583 ns |    749.10 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 128        |     81.77 ns |     0.479 ns |     0.374 ns |     81.83 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 8192       | 38,241.41 ns |   592.292 ns |   462.423 ns | 38,313.08 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | Mixed(...)gates [21] | 8192       |  4,373.03 ns |    11.984 ns |    10.623 ns |  4,372.00 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 8192       | 45,407.27 ns |   876.842 ns | 1,140.142 ns | 45,305.08 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | Mixed(...)gates [21] | 8192       |  4,841.84 ns |    13.875 ns |    12.300 ns |  4,844.33 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 128        |    715.61 ns |    13.934 ns |    12.352 ns |    718.27 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 128        |     92.80 ns |     0.554 ns |     0.492 ns |     92.63 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 128        |    814.53 ns |    16.191 ns |    31.579 ns |    796.44 ns |  0.5035 |    4216 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 128        |     97.37 ns |     0.291 ns |     0.258 ns |     97.33 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 8192       | 43,324.08 ns |   835.526 ns |   781.551 ns | 43,488.72 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-JMDAGQ | .NET 10.0 | MostlySurrogates     | 8192       |  5,585.16 ns |    25.642 ns |    21.412 ns |  5,584.12 ns |       - |         - |
| 'old Count() over EnumerateRunesToChars' | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 8192       | 50,161.87 ns | 1,000.371 ns | 1,466.331 ns | 50,378.08 ns | 31.3110 |  262264 B |
| 'new CountRunes(span)'                   | Job-OOTPKI | .NET 8.0  | MostlySurrogates     | 8192       |  5,831.10 ns |    30.611 ns |    27.135 ns |  5,827.66 ns |       - |         - |

Ofc. there are order of magnitude improvements when going from per rune allocations to allocation-free (aside from the instantiation of string from truncated span) methods, as the allocation and garbage collection of large amount of small objects is quite expensive.

Benchmark of buffer handling in XdrReaderWriter
| Method                                          | Job        | Toolchain | Mean       | Error     | StdDev    | Gen0   | Allocated |
|------------------------------------------------ |----------- |---------- |-----------:|----------:|----------:|-------:|----------:|
| 'master ReadBoolean()'                          | Job-JMDAGQ | .NET 10.0 |  1.8462 ns | 0.0212 ns | 0.0198 ns |      - |         - |
| 'local_opt2 ReadBoolean() (shared)'             | Job-JMDAGQ | .NET 10.0 |  0.8733 ns | 0.0127 ns | 0.0119 ns |      - |         - |
| 'local_opt2 ReadBoolean() (stackalloc)'         | Job-JMDAGQ | .NET 10.0 |  0.5557 ns | 0.0257 ns | 0.0228 ns |      - |         - |
| 'local_opt2 ReadBoolean() (rent always)'        | Job-JMDAGQ | .NET 10.0 |  9.2055 ns | 0.0498 ns | 0.0466 ns |      - |         - |
| 'local_opt2 ReadBoolean() (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 |  0.7566 ns | 0.0080 ns | 0.0071 ns |      - |         - |
| 'local_opt2 ReadBoolean() (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | 10.3027 ns | 0.0861 ns | 0.0719 ns |      - |         - |
| 'master ReadBoolean()'                          | Job-OOTPKI | .NET 8.0  |  5.2645 ns | 0.1155 ns | 0.1330 ns | 0.0038 |      32 B |
| 'local_opt2 ReadBoolean() (shared)'             | Job-OOTPKI | .NET 8.0  |  0.9108 ns | 0.0294 ns | 0.0260 ns |      - |         - |
| 'local_opt2 ReadBoolean() (stackalloc)'         | Job-OOTPKI | .NET 8.0  |  0.6007 ns | 0.0082 ns | 0.0077 ns |      - |         - |
| 'local_opt2 ReadBoolean() (rent always)'        | Job-OOTPKI | .NET 8.0  | 21.8549 ns | 0.1213 ns | 0.1134 ns |      - |         - |
| 'local_opt2 ReadBoolean() (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  |  0.5880 ns | 0.0125 ns | 0.0116 ns |      - |         - |
| 'local_opt2 ReadBoolean() (rent always, clear)' | Job-OOTPKI | .NET 8.0  | 20.7321 ns | 0.0914 ns | 0.0810 ns |      - |         - |

| Method                                                    | Job        | Toolchain | Mean       | Error     | StdDev    | Allocated |
|---------------------------------------------------------- |----------- |---------- |-----------:|----------:|----------:|----------:|
| 'master ReadInt32() using shared _smallBuffer'            | Job-JMDAGQ | .NET 10.0 |  1.0302 ns | 0.0362 ns | 0.0339 ns |         - |
| 'local_opt2 ReadInt32() (stackalloc)'                     | Job-JMDAGQ | .NET 10.0 |  0.5182 ns | 0.0056 ns | 0.0050 ns |         - |
| 'local_opt2 ReadInt32() renting buffer'                   | Job-JMDAGQ | .NET 10.0 |  9.5122 ns | 0.0642 ns | 0.0536 ns |         - |
| 'local_opt2 ReadInt32() using shared _smallBuffer, clear' | Job-JMDAGQ | .NET 10.0 |  3.0379 ns | 0.0208 ns | 0.0195 ns |         - |
| 'local_opt2 ReadInt32() (stackalloc, clear)'              | Job-JMDAGQ | .NET 10.0 |  0.5417 ns | 0.0145 ns | 0.0128 ns |         - |
| 'local_opt2 ReadInt32() renting buffer, clear'            | Job-JMDAGQ | .NET 10.0 | 10.3372 ns | 0.0458 ns | 0.0406 ns |         - |
| 'master ReadInt32() using shared _smallBuffer'            | Job-OOTPKI | .NET 8.0  |  0.9552 ns | 0.0122 ns | 0.0114 ns |         - |
| 'local_opt2 ReadInt32() (stackalloc)'                     | Job-OOTPKI | .NET 8.0  |  0.5161 ns | 0.0083 ns | 0.0074 ns |         - |
| 'local_opt2 ReadInt32() renting buffer'                   | Job-OOTPKI | .NET 8.0  | 24.9393 ns | 0.0835 ns | 0.0781 ns |         - |
| 'local_opt2 ReadInt32() using shared _smallBuffer, clear' | Job-OOTPKI | .NET 8.0  |  4.5371 ns | 0.0779 ns | 0.0691 ns |         - |
| 'local_opt2 ReadInt32() (stackalloc, clear)'              | Job-OOTPKI | .NET 8.0  |  0.7336 ns | 0.0082 ns | 0.0077 ns |         - |
| 'local_opt2 ReadInt32() renting buffer, clear'            | Job-OOTPKI | .NET 8.0  | 21.4044 ns | 0.1785 ns | 0.1490 ns |         - |

| 'local_opt2 ReadInt32() using shared _smallBuffer, clear with AsSpan' | Job-JMDAGQ | .NET 10.0 | 1.113 ns | 0.0208 ns | 0.0194 ns |         - |
| 'local_opt2 ReadInt32() using shared _smallBuffer, clear with AsSpan' | Job-OOTPKI | .NET 8.0  | 1.131 ns | 0.0215 ns | 0.0202 ns |         - |

| Method                                                  | Job        | Toolchain | Varying | Mean      | Error     | StdDev    | Gen0   | Allocated |
|-------------------------------------------------------- |----------- |---------- |-------- |----------:|----------:|----------:|-------:|----------:|
| 'master ReadGuid(int)'                                  | Job-JMDAGQ | .NET 10.0 | False   | 10.333 ns | 0.1596 ns | 0.1333 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int)'                              | Job-JMDAGQ | .NET 10.0 | False   |  5.502 ns | 0.0196 ns | 0.0183 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer)'        | Job-JMDAGQ | .NET 10.0 | False   |  7.343 ns | 0.0338 ns | 0.0316 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer, clear)' | Job-JMDAGQ | .NET 10.0 | False   |  9.989 ns | 0.0497 ns | 0.0440 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (rent always)'                | Job-JMDAGQ | .NET 10.0 | False   | 14.747 ns | 0.0621 ns | 0.0581 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (stackalloc, clear)'          | Job-JMDAGQ | .NET 10.0 | False   |  6.005 ns | 0.0324 ns | 0.0303 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (rent always, clear)'         | Job-JMDAGQ | .NET 10.0 | False   | 15.693 ns | 0.0667 ns | 0.0591 ns |      - |         - |
| 'master ReadGuid(int)'                                  | Job-OOTPKI | .NET 8.0  | False   | 10.727 ns | 0.1286 ns | 0.1140 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int)'                              | Job-OOTPKI | .NET 8.0  | False   |  5.616 ns | 0.0167 ns | 0.0139 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer)'        | Job-OOTPKI | .NET 8.0  | False   |  7.917 ns | 0.0312 ns | 0.0292 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer, clear)' | Job-OOTPKI | .NET 8.0  | False   |  8.587 ns | 0.0666 ns | 0.0556 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (rent always)'                | Job-OOTPKI | .NET 8.0  | False   | 25.562 ns | 0.0598 ns | 0.0530 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (stackalloc, clear)'          | Job-OOTPKI | .NET 8.0  | False   |  6.104 ns | 0.0295 ns | 0.0276 ns |      - |         - |
| 'local_opt2 ReadGuid(int) (rent always, clear)'         | Job-OOTPKI | .NET 8.0  | False   | 28.966 ns | 0.1026 ns | 0.0960 ns |      - |         - |
| 'master ReadGuid(int)'                                  | Job-JMDAGQ | .NET 10.0 | True    | 14.074 ns | 0.0945 ns | 0.0838 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int)'                              | Job-JMDAGQ | .NET 10.0 | True    | 15.599 ns | 0.0758 ns | 0.0672 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer)'        | Job-JMDAGQ | .NET 10.0 | True    | 14.218 ns | 0.0701 ns | 0.0656 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer, clear)' | Job-JMDAGQ | .NET 10.0 | True    | 15.493 ns | 0.1270 ns | 0.1126 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (rent always)'                | Job-JMDAGQ | .NET 10.0 | True    | 14.905 ns | 0.1333 ns | 0.1113 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (stackalloc, clear)'          | Job-JMDAGQ | .NET 10.0 | True    | 16.223 ns | 0.0889 ns | 0.0788 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (rent always, clear)'         | Job-JMDAGQ | .NET 10.0 | True    | 16.189 ns | 0.0673 ns | 0.0629 ns | 0.0048 |      40 B |
| 'master ReadGuid(int)'                                  | Job-OOTPKI | .NET 8.0  | True    | 15.159 ns | 0.0828 ns | 0.0774 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int)'                              | Job-OOTPKI | .NET 8.0  | True    | 14.800 ns | 0.0850 ns | 0.0795 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer)'        | Job-OOTPKI | .NET 8.0  | True    | 14.694 ns | 0.0633 ns | 0.0561 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (shared _smallBuffer, clear)' | Job-OOTPKI | .NET 8.0  | True    | 16.649 ns | 0.1274 ns | 0.1063 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (rent always)'                | Job-OOTPKI | .NET 8.0  | True    | 15.557 ns | 0.0964 ns | 0.0902 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (stackalloc, clear)'          | Job-OOTPKI | .NET 8.0  | True    | 16.606 ns | 0.1043 ns | 0.0924 ns | 0.0048 |      40 B |
| 'local_opt2 ReadGuid(int) (rent always, clear)'         | Job-OOTPKI | .NET 8.0  | True    | 19.030 ns | 0.0784 ns | 0.0695 ns | 0.0048 |      40 B |

| Method                                                    | Job        | Toolchain | Length | Mean        | Error     | StdDev    | Median      | Gen0   | Allocated |
|---------------------------------------------------------- |----------- |---------- |------- |------------:|----------:|----------:|------------:|-------:|----------:|
| 'master ReadString(Charset,int)'                          | Job-JMDAGQ | .NET 10.0 | 16     |    19.30 ns |  0.262 ns |  0.232 ns |    19.27 ns | 0.0115 |      96 B |
| 'local_opt2 ReadString(Charset,int)'                      | Job-JMDAGQ | .NET 10.0 | 16     |    20.36 ns |  0.298 ns |  0.264 ns |    20.42 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (rent always)'        | Job-JMDAGQ | .NET 10.0 | 16     |    26.13 ns |  0.375 ns |  0.332 ns |    26.13 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | 16     |    21.65 ns |  0.383 ns |  0.320 ns |    21.72 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | 16     |    32.56 ns |  0.681 ns |  0.976 ns |    32.10 ns | 0.0067 |      56 B |
| 'master ReadString(Charset,int)'                          | Job-OOTPKI | .NET 8.0  | 16     |    27.17 ns |  0.591 ns |  0.494 ns |    27.19 ns | 0.0115 |      96 B |
| 'local_opt2 ReadString(Charset,int)'                      | Job-OOTPKI | .NET 8.0  | 16     |    19.66 ns |  0.335 ns |  0.297 ns |    19.67 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (rent always)'        | Job-OOTPKI | .NET 8.0  | 16     |    45.79 ns |  0.136 ns |  0.127 ns |    45.79 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | 16     |    21.78 ns |  0.306 ns |  0.286 ns |    21.84 ns | 0.0067 |      56 B |
| 'local_opt2 ReadString(Charset,int) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | 16     |    43.31 ns |  0.267 ns |  0.237 ns |    43.28 ns | 0.0067 |      56 B |
| 'master ReadString(Charset,int)'                          | Job-JMDAGQ | .NET 10.0 | 8192   | 1,219.05 ns | 24.305 ns | 70.124 ns | 1,229.67 ns | 2.9354 |   24624 B |
| 'local_opt2 ReadString(Charset,int)'                      | Job-JMDAGQ | .NET 10.0 | 8192   |   851.78 ns | 16.776 ns | 25.110 ns |   841.90 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (rent always)'        | Job-JMDAGQ | .NET 10.0 | 8192   |   821.44 ns | 15.220 ns | 32.762 ns |   815.46 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | 8192   |   901.15 ns |  8.380 ns |  7.429 ns |   901.89 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | 8192   |   897.47 ns |  6.985 ns |  6.192 ns |   899.46 ns | 1.9569 |   16408 B |
| 'master ReadString(Charset,int)'                          | Job-OOTPKI | .NET 8.0  | 8192   | 1,144.40 ns | 22.209 ns | 22.807 ns | 1,142.92 ns | 2.9354 |   24624 B |
| 'local_opt2 ReadString(Charset,int)'                      | Job-OOTPKI | .NET 8.0  | 8192   |   910.72 ns | 18.811 ns | 55.465 ns |   926.57 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (rent always)'        | Job-OOTPKI | .NET 8.0  | 8192   |   968.63 ns | 12.095 ns | 11.314 ns |   964.06 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | 8192   | 1,111.98 ns | 22.006 ns | 23.546 ns | 1,111.28 ns | 1.9569 |   16408 B |
| 'local_opt2 ReadString(Charset,int) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | 8192   |   916.42 ns | 17.810 ns | 16.659 ns |   910.25 ns | 1.9569 |   16408 B |

| Method                                        | Job        | Toolchain | Value | Mean       | Error     | StdDev    | Gen0   | Allocated |
|---------------------------------------------- |----------- |---------- |------ |-----------:|----------:|----------:|-------:|----------:|
| 'master Write(bool) (alloc)'                  | Job-JMDAGQ | .NET 10.0 | True  |  1.0387 ns | 0.0074 ns | 0.0066 ns |      - |         - |
| 'local_opt2 Write(bool) (stackalloc)'         | Job-JMDAGQ | .NET 10.0 | True  |  0.9071 ns | 0.0183 ns | 0.0153 ns |      - |         - |
| 'local_opt2 Write(bool) (rent always)'        | Job-JMDAGQ | .NET 10.0 | True  | 10.9253 ns | 0.0387 ns | 0.0343 ns |      - |         - |
| 'local_opt2 Write(bool) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | True  |  1.1011 ns | 0.0176 ns | 0.0156 ns |      - |         - |
| 'local_opt2 Write(bool) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | True  | 10.8589 ns | 0.0415 ns | 0.0388 ns |      - |         - |
| 'master Write(bool) (alloc)'                  | Job-OOTPKI | .NET 8.0  | True  |  5.3841 ns | 0.1244 ns | 0.1222 ns | 0.0038 |      32 B |
| 'local_opt2 Write(bool) (stackalloc)'         | Job-OOTPKI | .NET 8.0  | True  |  0.8540 ns | 0.0097 ns | 0.0091 ns |      - |         - |
| 'local_opt2 Write(bool) (rent always)'        | Job-OOTPKI | .NET 8.0  | True  | 23.8173 ns | 0.1018 ns | 0.0902 ns |      - |         - |
| 'local_opt2 Write(bool) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | True  |  1.8278 ns | 0.0159 ns | 0.0133 ns |      - |         - |
| 'local_opt2 Write(bool) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | True  | 21.2236 ns | 0.0771 ns | 0.0683 ns |      - |         - |

| Method                                            | Job        | Toolchain | Varying | Mean      | Error     | StdDev    | Gen0   | Allocated |
|-------------------------------------------------- |----------- |---------- |-------- |----------:|----------:|----------:|-------:|----------:|
| 'master Write(Guid,int) (alloc)'                  | Job-JMDAGQ | .NET 10.0 | False   | 13.794 ns | 0.1303 ns | 0.1155 ns | 0.0048 |      40 B |
| 'local_opt2 Write(Guid,int) (stackalloc)'         | Job-JMDAGQ | .NET 10.0 | False   |  6.506 ns | 0.0147 ns | 0.0130 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always)'        | Job-JMDAGQ | .NET 10.0 | False   | 13.773 ns | 0.0432 ns | 0.0404 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | False   |  6.782 ns | 0.0257 ns | 0.0241 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | False   | 15.226 ns | 0.0611 ns | 0.0510 ns |      - |         - |
| 'master Write(Guid,int) (alloc)'                  | Job-OOTPKI | .NET 8.0  | False   | 22.003 ns | 0.1068 ns | 0.0947 ns | 0.0210 |     176 B |
| 'local_opt2 Write(Guid,int) (stackalloc)'         | Job-OOTPKI | .NET 8.0  | False   |  6.429 ns | 0.0346 ns | 0.0289 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always)'        | Job-OOTPKI | .NET 8.0  | False   | 30.266 ns | 0.1391 ns | 0.1302 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | False   |  6.795 ns | 0.0281 ns | 0.0219 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | False   | 28.376 ns | 0.1164 ns | 0.1032 ns |      - |         - |
| 'master Write(Guid,int) (alloc)'                  | Job-JMDAGQ | .NET 10.0 | True    | 14.546 ns | 0.1021 ns | 0.0955 ns | 0.0048 |      40 B |
| 'local_opt2 Write(Guid,int) (stackalloc)'         | Job-JMDAGQ | .NET 10.0 | True    |  6.640 ns | 0.0281 ns | 0.0235 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always)'        | Job-JMDAGQ | .NET 10.0 | True    | 14.207 ns | 0.0667 ns | 0.0624 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | True    |  7.135 ns | 0.0261 ns | 0.0204 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | True    | 16.790 ns | 0.0446 ns | 0.0417 ns |      - |         - |
| 'master Write(Guid,int) (alloc)'                  | Job-OOTPKI | .NET 8.0  | True    | 27.546 ns | 0.1983 ns | 0.1758 ns | 0.0249 |     208 B |
| 'local_opt2 Write(Guid,int) (stackalloc)'         | Job-OOTPKI | .NET 8.0  | True    |  6.801 ns | 0.0214 ns | 0.0189 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always)'        | Job-OOTPKI | .NET 8.0  | True    | 31.503 ns | 0.0817 ns | 0.0725 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | True    |  7.310 ns | 0.0354 ns | 0.0296 ns |      - |         - |
| 'local_opt2 Write(Guid,int) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | True    | 29.880 ns | 0.1814 ns | 0.1697 ns |      - |         - |

| Method                                       | Job        | Toolchain | Value     | Mean       | Error     | StdDev    | Gen0   | Allocated |
|--------------------------------------------- |----------- |---------- |---------- |-----------:|----------:|----------:|-------:|----------:|
| 'master Write(int) (alloc)'                  | Job-JMDAGQ | .NET 10.0 | 123456789 |  0.5760 ns | 0.0117 ns | 0.0109 ns |      - |         - |
| 'local_opt2 Write(int) (stackalloc)'         | Job-JMDAGQ | .NET 10.0 | 123456789 |  0.5028 ns | 0.0266 ns | 0.0236 ns |      - |         - |
| 'local_opt2 Write(int) (rent always)'        | Job-JMDAGQ | .NET 10.0 | 123456789 | 10.5113 ns | 0.0482 ns | 0.0450 ns |      - |         - |
| 'local_opt2 Write(int) (stackalloc, clear)'  | Job-JMDAGQ | .NET 10.0 | 123456789 |  0.6139 ns | 0.0076 ns | 0.0071 ns |      - |         - |
| 'local_opt2 Write(int) (rent always, clear)' | Job-JMDAGQ | .NET 10.0 | 123456789 | 10.8079 ns | 0.0564 ns | 0.0471 ns |      - |         - |
| 'master Write(int) (alloc)'                  | Job-OOTPKI | .NET 8.0  | 123456789 |  4.4407 ns | 0.0798 ns | 0.0666 ns | 0.0038 |      32 B |
| 'local_opt2 Write(int) (stackalloc)'         | Job-OOTPKI | .NET 8.0  | 123456789 |  1.9084 ns | 0.0153 ns | 0.0127 ns |      - |         - |
| 'local_opt2 Write(int) (rent always)'        | Job-OOTPKI | .NET 8.0  | 123456789 | 23.3573 ns | 0.0798 ns | 0.0707 ns |      - |         - |
| 'local_opt2 Write(int) (stackalloc, clear)'  | Job-OOTPKI | .NET 8.0  | 123456789 |  1.9451 ns | 0.0060 ns | 0.0053 ns |      - |         - |
| 'local_opt2 Write(int) (rent always, clear)' | Job-OOTPKI | .NET 8.0  | 123456789 | 20.7482 ns | 0.0902 ns | 0.0844 ns |      - |         - |

| Method                                              | Job        | Toolchain | CharLength | Mean       | Error      | StdDev     | Median     | Gen0   | Gen1   | Allocated |
|---------------------------------------------------- |----------- |---------- |----------- |-----------:|-----------:|-----------:|-----------:|-------:|-------:|----------:|
| 'master Write(string) (GetBytes alloc)'             | Job-JMDAGQ | .NET 10.0 | 16         |  11.972 ns |  0.1377 ns |  0.1150 ns |  11.970 ns | 0.0048 |      - |      40 B |
| 'local_opt2 Write(string) (stackalloc/rent)'        | Job-JMDAGQ | .NET 10.0 | 16         |   8.717 ns |  0.0308 ns |  0.0288 ns |   8.721 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always)'            | Job-JMDAGQ | .NET 10.0 | 16         |  13.509 ns |  0.0772 ns |  0.0684 ns |  13.496 ns |      - |      - |         - |
| 'local_opt2 Write(string) (stackalloc/rent, clear)' | Job-JMDAGQ | .NET 10.0 | 16         |  10.160 ns |  0.0295 ns |  0.0276 ns |  10.158 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always, clear)'     | Job-JMDAGQ | .NET 10.0 | 16         |  15.498 ns |  0.0572 ns |  0.0507 ns |  15.502 ns |      - |      - |         - |
| 'master Write(string) (GetBytes alloc)'             | Job-OOTPKI | .NET 8.0  | 16         |  15.841 ns |  0.2437 ns |  0.2161 ns |  15.838 ns | 0.0086 |      - |      72 B |
| 'local_opt2 Write(string) (stackalloc/rent)'        | Job-OOTPKI | .NET 8.0  | 16         |  10.893 ns |  0.0682 ns |  0.0605 ns |  10.894 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always)'            | Job-OOTPKI | .NET 8.0  | 16         |  32.174 ns |  0.1287 ns |  0.1204 ns |  32.159 ns |      - |      - |         - |
| 'local_opt2 Write(string) (stackalloc/rent, clear)' | Job-OOTPKI | .NET 8.0  | 16         |  13.041 ns |  0.0288 ns |  0.0241 ns |  13.046 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always, clear)'     | Job-OOTPKI | .NET 8.0  | 16         |  33.083 ns |  0.2114 ns |  0.1765 ns |  33.066 ns |      - |      - |         - |
| 'master Write(string) (GetBytes alloc)'             | Job-JMDAGQ | .NET 10.0 | 8192       | 482.449 ns |  9.6229 ns | 25.1815 ns | 471.592 ns | 0.9813 |      - |    8216 B |
| 'local_opt2 Write(string) (stackalloc/rent)'        | Job-JMDAGQ | .NET 10.0 | 8192       | 164.660 ns |  0.6100 ns |  0.5408 ns | 164.541 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always)'            | Job-JMDAGQ | .NET 10.0 | 8192       | 163.243 ns |  0.3665 ns |  0.3428 ns | 163.237 ns |      - |      - |         - |
| 'local_opt2 Write(string) (stackalloc/rent, clear)' | Job-JMDAGQ | .NET 10.0 | 8192       | 576.081 ns |  3.4071 ns |  3.1870 ns | 576.369 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always, clear)'     | Job-JMDAGQ | .NET 10.0 | 8192       | 564.724 ns |  2.1457 ns |  2.0071 ns | 564.476 ns |      - |      - |         - |
| 'master Write(string) (GetBytes alloc)'             | Job-OOTPKI | .NET 8.0  | 8192       | 547.055 ns | 10.2174 ns | 23.4761 ns | 535.135 ns | 0.9851 | 0.0305 |    8248 B |
| 'local_opt2 Write(string) (stackalloc/rent)'        | Job-OOTPKI | .NET 8.0  | 8192       | 204.426 ns |  0.6204 ns |  0.5500 ns | 204.526 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always)'            | Job-OOTPKI | .NET 8.0  | 8192       | 197.471 ns |  0.5817 ns |  0.5441 ns | 197.474 ns |      - |      - |         - |
| 'local_opt2 Write(string) (stackalloc/rent, clear)' | Job-OOTPKI | .NET 8.0  | 8192       | 632.919 ns |  2.7815 ns |  2.6018 ns | 633.331 ns |      - |      - |         - |
| 'local_opt2 Write(string) (rent always, clear)'     | Job-OOTPKI | .NET 8.0  | 8192       | 631.340 ns |  1.6094 ns |  1.4267 ns | 631.268 ns |      - |      - |         - |

There are benchmarks comparing various methods of buffer handling, including: original, stackalloc by default (where available), using of preallocated smallBuffer (sizeof(data) <= 16) and using renting of ArrayPool. Also there are variants which perform erasing of data (clear variants) from buffers after operations for security.

The main pattern in speed is:
stackalloc > _smallBuffer > new byte[] > ArrayPool (for small types)
and
stackalloc > ArrayPool > new byte[] (for large types)

However in repeated operations only new byte[] operations are causing new allocations, rest are allocation-free (aside from the first in size category array pool rent)

Synchronous operations are able to use stackalloc for IO buffer, while async operations have to use heap-allocated buffer, so the synchronous operations can benefit regardless of data size, while async operations can benefit from reduced allocations, but have slightly worse cpu performance for renting the small buffers, though in practice it shouldn't be measurable when dealing with actual IO, networking and db engine.

Also there is a potential for implementing clean-up of data from buffers for hypothetical security improvements, as the performance impact of such action is not very big.

Also the benchmarks were performed for two of the currently main .net versions (8 and 10) and in some cases .net 10's escape analysis improvements can avoid some of the heap allocations for array creation (see bool and int benchmarks), but the performance of stackalloc is still better, regardless of version, also .net 10 reduces performance impact from using ArrayPool. The only thing left to try is to change Array pool usage from global shared to static for XdrReaderWriter.

Source for the benchmarks is available in perf project in the branch of my fork: local_opt2_benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants